动态状态表示学习是机器人学习中的重要任务。可以捕获动力学信息的潜在空间在加速模型的自由强化学习,缩小模拟到现实差距以及降低运动计划的复杂性等领域中具有广泛的应用。但是,当前的动态状态表示方法在复杂的动态系统(例如可变形对象)上的扩展很差,并且不能将良好定义的仿真函数直接嵌入到训练管道中。我们提出了DIFFSRL,这是一种动态状态表示学习管道,利用可区分的模拟可以将复杂的动力学模型嵌入到端到端训练的一部分。我们还将可区分的动态约束作为管道的一部分集成,这为潜在状态提供了意识到动态约束的激励措施。我们进一步建立了在软体体模拟系统PlastonElab上学习基准的国家表示基准,我们的模型在捕获长期动态和奖励预测方面表现出了卓越的性能。
translated by 谷歌翻译
We propose a novel teacher-student model for semi-supervised multi-organ segmentation. In teacher-student model, data augmentation is usually adopted on unlabeled data to regularize the consistent training between teacher and student. We start from a key perspective that fixed relative locations and variable sizes of different organs can provide distribution information where a multi-organ CT scan is drawn. Thus, we treat the prior anatomy as a strong tool to guide the data augmentation and reduce the mismatch between labeled and unlabeled images for semi-supervised learning. More specifically, we propose a data augmentation strategy based on partition-and-recovery N$^3$ cubes cross- and within- labeled and unlabeled images. Our strategy encourages unlabeled images to learn organ semantics in relative locations from the labeled images (cross-branch) and enhances the learning ability for small organs (within-branch). For within-branch, we further propose to refine the quality of pseudo labels by blending the learned representations from small cubes to incorporate local attributes. Our method is termed as MagicNet, since it treats the CT volume as a magic-cube and $N^3$-cube partition-and-recovery process matches with the rule of playing a magic-cube. Extensive experiments on two public CT multi-organ datasets demonstrate the effectiveness of MagicNet, and noticeably outperforms state-of-the-art semi-supervised medical image segmentation approaches, with +7% DSC improvement on MACT dataset with 10% labeled images.
translated by 谷歌翻译
We propose EM-PASTE: an Expectation Maximization(EM) guided Cut-Paste compositional dataset augmentation approach for weakly-supervised instance segmentation using only image-level supervision. The proposed method consists of three main components. The first component generates high-quality foreground object masks. To this end, an EM-like approach is proposed that iteratively refines an initial set of object mask proposals generated by a generic region proposal method. Next, in the second component, high-quality context-aware background images are generated using a text-to-image compositional synthesis method like DALL-E. Finally, the third component creates a large-scale pseudo-labeled instance segmentation training dataset by compositing the foreground object masks onto the original and generated background images. The proposed approach achieves state-of-the-art weakly-supervised instance segmentation results on both the PASCAL VOC 2012 and MS COCO datasets by using only image-level, weak label information. In particular, it outperforms the best baseline by +7.4 and +2.8 mAP0.50 on PASCAL and COCO, respectively. Further, the method provides a new solution to the long-tail weakly-supervised instance segmentation problem (when many classes may only have few training samples), by selectively augmenting under-represented classes.
translated by 谷歌翻译
We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments.
translated by 谷歌翻译
Automata-based representations play an important role in control and planning in sequential decision-making, but obtaining high-level task knowledge for building automata is often difficult. Although large-scale generative language models (GLMs) can help automatically distill task knowledge, the textual outputs from GLMs are not directly utilizable in sequential decision-making. We resolve this problem by proposing a novel algorithm named GLM2FSA, which obtains high-level task knowledge, represented in a finite state automaton (FSA), from a given brief description of the task goal. GLM2FSA sends queries to a GLM for task knowledge in textual form and then builds a FSA to represent the textual knowledge. This algorithm fills the gap between text and automata-based representations, and the constructed FSA can be directly utilized in sequential decision-making. We provide examples to demonstrate how GLM2FSA constructs FSAs to represent knowledge encoded in the texts generated by the large-scale GLMs.
translated by 谷歌翻译
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text promts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures.
translated by 谷歌翻译
Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works' effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN). The PIPMN reaches 96\% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2\% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or model transfer. Public code is available at: https://github.com/JNAIC/PIPMN
translated by 谷歌翻译
半监督语义分割的流行方法主要采用了使用卷积神经网络(CNN)(CNN)的统一网络模型,并在应用于输入或模型的小型扰动上实施模型预测的一致性。但是,这种学习范式受到a)基于CNN模型的学习能力有限; b)学习未标记数据的判别特征的能力有限; c)从整个图像中对全球和本地信息的学习有限。在本文中,我们提出了一种新型的半监督学习方法,称为Transformer-CNN队列(TCC),该方法由两个基于视觉变压器(VIT)的学生组成,另一种是基于CNN的学生。我们的方法巧妙地通过伪标记来纳入预测和异质特征空间上的多级一致性正则化,用于未标记的数据。首先,由于VIT学生的输入是图像贴片,因此特征地图提取了编码至关重要的类统计。为此,我们建议首先利用每个学生作为伪标签并生成类吸引功能(CF)映射的班级感知功能一致性蒸馏(CFCD)。然后,它通过学生之间的CF地图传输知识。其次,随着VIT学生对所有层具有更统一的表示,我们提出一致性感知的交叉蒸馏以在类像素方面的预测之间转移知识。我们在CityScapes和Pascal VOC 2012数据集上验证了TCC框架,该数据集大大优于现有的半监督方法。
translated by 谷歌翻译
有几篇论文正确包括人工智能(AI)培训数据中的少数群体,以改善对少数群体和/或一般社会的测试推论。一般社会由少数派和多数利益相关者组成。一个普遍的误解是,少数群体的包容性不会单独提高多数群体的绩效。在本文中,我们令人惊讶的是,包括少数样本可以改善多数族裔的测试错误。换句话说,少数群体的包容性会导致多数群体增强(MIME)的性能。给出了哑剧效应的理论存在证明,并发现与六个不同数据集的实验结果一致。项目网页:https://visual.ee.ucla.edu/mime.htm/
translated by 谷歌翻译
培训计算机视觉模型通常需要在各种场景配置和属性集中收集和标记大量图像。这个过程非常耗时,并且要确保捕获的数据分布映射到应用程序方案的目标域,这是一项挑战。最近,综合数据已成为解决这两个问题的一种方式。但是,现有方法要么要求人类专家手动调整每个场景属性,要么使用几乎无法控制的自动方法;这需要渲染大量的随机数据变化,这很慢,对于目标域通常是次优的。我们介绍了第一个完全可区分的合成数据管道,该数据管道使用具有目标应用程序损耗函数的闭环中的神经辐射场(NERF)。我们的方法可以在没有人工的情况下生成数据,以最大程度地提高目标任务的准确性。我们说明了我们方法对合成和现实对象检测任务的有效性。我们还引入了一个新的“ YCB野外”数据集和基准标准,该数据集和基准为对象检测提供了一种在现实世界环境中具有多种姿势的测试方案。
translated by 谷歌翻译